Duplicate detection in the Reuters collection 1
نویسنده
چکیده
In a bibliographic database, the main task is not to find exact duplicate records, rather it is to find those that refer to the same work but differ in some manner. Differences are typically due to inaccurate or inconsistent data entry. One such detection method was developed by Ridley [Ridley 92] who adopted a two stage technique. First, all records in a database were assigned a number generated from a hashing function that used as its input, fields of a bibliographic record. Any records that had the same hashing number were examined in greater detail in the second stage. This entailed a comparison of fields by customised processes: i.e. the author field process looked for missing initials; the title field process looked for a missing suffix. Detection techniques of this kind are supported by the work of O’Neill et al. [O’Neill 93] who manually examined duplicate bibliographic records to find which fields were most likely to differ.
منابع مشابه
Duplicate detection in the Reuters collection
In a bibliographic database, the main task is not to find exact duplicate records, rather it is to find those that refer to the same work but differ in some manner. Differences are typically due to inaccurate or inconsistent data entry. One such detection method was developed by Ridley [Ridley 92] who adopted a two stage technique. First, all records in a database were assigned a number generat...
متن کاملA New Method for Duplicate Detection Using Hierarchical Clustering of Records
Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...
متن کاملIdentification of MIR-Flickr Near-duplicate Images - A Benchmark Collection for Near-duplicate Detection
There are many contexts where the automated detection of near-duplicate images is important, for example the detection of copyright infringement or images of child abuse. There are many published methods for the detection of similar and near-duplicate images; however it is still uncommon for methods to be objectively compared with each other, probably because of a lack of any good framework in ...
متن کاملReuters test collection Saturday , 11 June , 1994
This short paper presents the little known Reuters 22,173 test collection, which is significantly larger than most traditional test collections. In addition, Reuters has none of the recall calculation problems normally associated with some of the larger test collections now available. This paper explains the method (derived from Lewis [Lewis 91]) used to perform retrieval experiments on the Reu...
متن کاملCategorizing Gigabytes: Experiments on the RCV1 Corpus
This paper presents categorization results performed by means of HITEC categorizer tool on the new benchmark document collection of text categorization, the Reuters Corpus Volume 1 (RCV1). RCV1 is an archive of over 800,000 manually categorized newswire stories made available by Reuters in 2000 for research purposes. This collection was released to take place of the Reuters-21578 collection tha...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997